02. As Good as the Data

L3A04 As Good As The Data

Garbage in, Garbage out

"In computer science, garbage in, garbage out (GIGO) describes the concept that flawed, or nonsense input data produces nonsense output or 'garbage'." - source

Data Size

If you are using a deep learning algorithm, as opposed to some traditional machine learning techniques, data size is even more important.

Deep learning (neural networks) often need to see many examples of every possible category before they can learn to distinguish between different classes of data and find general patterns in some data. If you have too few data points or your data is not evenly distributed between different categories that you want to distinguish, you could get some significant sampling bias in your end predictions; predictions that are biased towards classifying all data into one class, for example, or predictions that have learned to find patterns that are irrelevant to the task at hand.

Data Distribution and Pattern Detection:

  • Credit card fraud detection: most credit card transactions are valid, and so these datasets often have thousands of valid examples and very few examples of fraudulent transaction data, so you'd need to take steps to account for this imbalance otherwise a model will likely learn to classify all new data as valid since that is the most likely choice.
  • You might think of building a classifier to distinguish wolves from dogs. If all wolves are images with a snowy background, a machine learning model might mistakenly conflate snow with wolves, and you'll need more, varied data to create an accurate model.

Wolf/dog classifier learns to identify snow rather than the different animal features.

Wolf/dog classifier learns to identify snow rather than the different animal features.